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CN Abstract 

Topic models provide a useful method for dimensionality reduction and exploratory data 
£^ analysis in large text corpora. Most approaches to topic model inference have been based on 

a maximum likelihood objective. Efficient algorithms exist that approximate this objective, 
but they have no provable guarantees. Recently, algorithms have been introduced that provide 
provable bounds, but these algorithms are not practical because they are inefficient and not ro- 
bust to violations of model assumptions. In this paper we present an algorithm for topic model 
inference that is both provable and practical. The algorithm produces results comparable to the 
I 1 best MCMC implementations while running orders of magnitude faster. 

GC 
O 

1 — 1 1 Introduction 

J> Topic modeling is a popular method that learns thematic structure from large document collections 

without human supervision. The model is simple: documents are mixtures of topics, which are mod- 
eled as distributions over a vocabulary [Blei, 2012] . Each word token is generated by selecting a topic 
from a document-specific distribution, and then selecting a specific word from that topic-specific 
distribution. Posterior inference over document-topic and topic-word distributions is intractable - 
in the worst case it is NP-hard even for just two topics [Arora et al., 2012b]. As a result, researchers 
have used approximate inference techniques such as singular value decomposition [Deerwester et al., 
1990], variational inference [Blei et al., 2003], and MCMC [Griffiths k Steyvers, 2004]. 

Recent work in theoretical computer science focuses on designing provably efficient algorithms 
for topic modeling. These treat the topic modeling problem as one of statistical recovery: assuming 
the data was generated perfectly from the hypothesized model using an unknown set of parameter 
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values, the goal is to recover the model parameters in polynomial time given a reasonable number 
of samples. 

Arora et al. [2012b] present an algorithm that provably recovers the parameters of topic models 
provided that the topics meet a certain separability assumption [Donoho & Stodden, 2003] . Separa- 
bility requires that every topic contains at least one anchor word that has non-zero probability only 
in that topic. If a document contains this anchor word, then it is guaranteed that the corresponding 
topic is among the set of topics used to generate the document. The algorithm proceeds in two 
steps: first it selects anchor words for each topic; and second, in the recovery step, it reconstructs 
topic distributions given those anchor words. The input for the algorithm is the second-order 
moment matrix of word-word co-occurrences. 

Anandkumar et al. [2012] present a provable algorithm based on third-order moments that does 
not require separability, but, unlike the algorithm of Arora et al., assumes that topics are not 
correlated. Although standard topic models like LDA [Blei et al., 2003] assume that the choice 
of topics used to generate the document are uncorrelated, there is strong evidence that topics are 
dependent [Blei &; Lafferty, 2007, Li & McCallum, 2007]: economics and politics are more likely to 
co-occur than economics and cooking. 

Both algorithms run in polynomial time, but the bounds that have been proven on their sample 
complexity are weak and their empirical runtime performance is slow. The algorithm presented by 
Arora et al. [2012b] solves numerous linear programs to find anchor words. Bittorf et al. [2012] 
and Gillis [2012] reduce the number of linear programs needed. All of these algorithms infer top- 
ics given anchor words using matrix inversion, which is notoriously unstable and noisy: matrix 
inversion frequently generates negative values for topic-word probabilities. 

In this paper we present three contributions. First, we replace linear programming with a combi- 
natorial anchor selection algorithm. So long as the separability assumption holds, we prove that this 
algorithm is stable in the presence of noise and thus has polynomial sample complexity for learning 
topic models. Second, we present a simple probabilistic interpretation of topic recovery given an- 
chor words that replaces matrix inversion with a new gradient-based inference method. Third, we 
present an empirical comparison between recovery-based algorithms and existing likelihood-based 
topic inference. We study both the empirical sample complexity of the algorithms on synthetic 
distributions and the performance of the algorithms on real-world document corpora. We find that 
our algorithm performs as well as collapsed Gibbs sampling on a variety of metrics, and runs at 
least an order of magnitude faster. 

Our algorithm both inherits the provable guarantees from Arora et al. [2012a, b] and also results 
in simple, practical implementations. We view our work as a step toward bridging the gap between 
statistical recovery approaches to machine learning and maximum likelihood estimation, allowing 
us to circumvent the computational intractability of maximum likelihood estimation yet still be 
robust to model error. 

2 Background 

We consider the learning problem for a class of admixture distributions that are frequently used 
for probabilistic topic models. Examples of such distributions include latent Dirichlet allocation 
[Blei et al., 2003], correlated topic models [Blei & Lafferty, 2007], and Pachinko allocation [Li & 
McCallum, 2007]. We denote the number of words in the vocabulary by V and the number of 
topics by K. Associated with each topic A; is a multinomial distribution over the words in the 
vocabulary, which we will denote as the column vector Ak of length V. Each of these topic models 
postulates a particular prior distribution r over the topic distribution of a document. For example, 
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in latent Dirichlet allocation (LDA) r is a Dirichlet distribution, and for the correlated topic model 
r is a logistic Normal distribution. The generative process for a document d begins by drawing 
the document's topic distribution Wd ~ t. Then, for each position i we sample a topic assignment 
z% ~ Wd, and finally a word W{ ~ A Zi . 

We can combine the column vectors Ak for each of the K topics to obtain the word-topic ma- 
trix A of dimension V x K. We can similarly combine the column vectors Wd for M documents to 
obtain the topic-document matrix W of dimension K x M. We emphasize that W is unknown and 
stochastically generated: we can never expect to be able to recover it. The learning task that we 
consider is to find the word-topic matrix A. For the case when r is Dirichlet (LDA), we also show 
how to learn hyperparameters of r. 

Maximum likelihood estimation of the word-topic distributions is NP-hard even for two topics 
[Arora et al., 2012b], and as a result researchers typically use approximate inference. The most 
popular approaches are variational inference [Blci et al., 2003], which optimizes an approximate 
objective, and Markov chain Monte Carlo [McCallum, 2002], which asymptotically samples from 
the posterior distribution but has no guarantees of convergence. 

Arora et al. [2012b] present an algorithm that provably learns the parameters of a topic model 
given samples from the model, provided that the word-topic distributions satisfy a condition called 
separability: 

Definition 2.1. The word-topic matrix A is p-separable for p > if for each topic k, there is some 
word i such that A4 & > p and A4 w = for k' 7^ k. 

Such a word is called an anchor word because when it occurs in a document, it is a perfect 
indicator that the document is at least partially about the corresponding topic, since there is no 
other topic that could have generated the word. Suppose that each document is of length D > 2, and 
let R = K T [WW T ] be the K x K topic-topic covariance matrix. Let be the expected proportion 
of topic k in a document generated according to r. The main result of Arora et al. [2012b] is: 

Theorem 2.2. There is a polynomial time algorithm that learns the parameters of a topic model 
if the number of documents is at least 

where p is defined above, 7 is the condition number of R, and a = max^j.' 0.^1 ay . The algorithm 
learns the word-topic matrix A and the topic-topic covariance matrix R up to additive error e. 

Unfortunately, this algorithm is not practical. Its running time is prohibitively large because it 
solves V linear programs, and its use of matrix inversion makes it unstable and sensitive to noise. 
In this paper, we will give various reformulations and modifications of this algorithm that alleviate 
these problems altogether. 

3 A Probabilistic Approach to Exploiting Separability 

The Arora et al. [2012b] algorithm has two steps: anchor selection, which identifies anchor words, 
and recovery, which recovers the parameters of A and of r. Both anchor selection and recovery 
take as input the matrix Q (of size V x V) of word-word co-occurrence counts, whose construction 
is described in the supplementary material. Q is normalized so that the sum of all entries is 1. 
The high-level flow of our complete learning algorithm is described in Algorithm 1, and follows the 
same two steps. In this section we will introduce a new recovery method based on a probabilistic 



3 



Algorithm 1. High Level Algorithm 

Input: Textual corpus 23, Number of anchors K, Tolerance parameters e a ,e& > 0. 
Output: Word-topic matrix A, topic-topic matrix R 

Q <— Word Co-occurences(T>) 

Form {Qi,Q2, ■■■Qv}-, the normalized rows of Q. 

S <- FastAnchorWords({Qi,Q2,-<9y}, K, e a ) (Algorithm 4) 

A, R <- RecoverKL((5, S, e b ) (Algorithm 3) 

return A, R 

Algorithm 2. Original Recover [Arora et al., 2012b] 
Input: Matrix Q, Set of anchor words S 
Output: Matrices A,R 

Permute rows and columns of Q 

Compute ps = Qsl 

Solve for z: Qs,SZ = PS 

Solve for A T ='(Q s , S Diag(i*))- 1 Q| 

Solve for R = Diag(z)Qs,sDiag(z) 

return A, R 



framework. We defer the discussion of anchor selection to the next section, where we provide a 
purely combinatorial algorithm for finding the anchor words. 

The original recovery procedure (which we call "Recover") from Arora et al. [2012b] is as follows. 
First, it permutes the Q matrix so that the first K rows and columns correspond to the anchor 
words. We will use the notation Qs to refer to the first K rows, and Qs,s f° r the first K rows and just 
the first K columns. If constructed from infinitely many documents, Q would be the second-order 
moment matrix Q = W L AWW T A T ] = AK[WW T )A T = ARA T ', with the following block structure: 

atiaT ^n/n rrT\ (DRD DRJJ T \ 

= = \ jj j R{ uT ) = {urd URU t J 

where D is a diagonal matrix of size K x K. Next, it solves for A and R using the algebraic 
manipulations outlined in Algorithm 2. 

The use of matrix inversion in Algorithm 2 results in substantial imprecision in the estimates 
when we have small sample sizes. The returned A and R matrices can even contain small negative 
values, requiring a subsequent projection onto the simplex. As we will show in Section 5, the origi- 
nal recovery method performs poorly relative to a likelihood-based algorithm. Part of the problem 
is that the original recover algorithm uses only K rows of the matrix Q (the rows for the anchor 
words), whereas Q is of dimension V x V. Besides ignoring most of the data, this has the additional 
complication that it relies completely on co-occurrences between a word and the anchors, and this 
estimate may be inaccurate if both words occur infrequently. 

Here we adopt a new probabilistic approach, which we describe below after introducing some 
notation. Consider any two words in a document and call them w\ and u>2, and let z\ and Z2 refer 
to their topic assignments. We will use A^k to index the matrix of word-topic distributions, i.e. 
Ai k = p{w\ = i\z\ = k) = p(u)2 = i\z2 = k). Given infinite data, the elements of the Q matrix can 
be interpreted as Qij = p(w\ = i, W2 = j). The row-normalized Q matrix, denoted Q, which plays 
a role in both finding the anchor words and the recovery step, can be interpreted as a conditional 
probability Qij = p{w2 = j\w\ = i). 

Denoting the indices of the anchor words as S = {si, S2, sk}, the rows indexed by elements 
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Algorithm 3. RecoverKL 

Input: Matrix Q, Set of anchor words S, tolerance parameter e. 
Output: Matrices A,R 

Normalize the rows of Q to form Q. 
Store the normalization constants p w = Ql. 
Q Sk is the row of Q for the k th anchor word, 
for i = 1, V do 

Solve d. — argmin^ Dkl(Qi\ \ X/fceS ^i,kQs k ) 

Subject to: Ylk ^i,k = 1 and C^k > 

With tolerance: e 
end for 
A' = diag(p w )C 

Normalize the columns of A' to form A. 
R = A^QA^ T 
return A, R 



of S are special in that every other row of Q lies in the convex hull of the rows indexed by the 
anchor words. To see this, first note that for an anchor word s/-, 

Qsk,i =/ J P( z i = k '\ w l = s k)p(w2 = j\zi = k') (1) 
k' 

= p(w 2 = j\zx = k), (2) 

where (1) uses the fact that in an admixture model w^i-Wi | z±, and (2) is because p(z\ = k\w\ = 
Sfc) = 1. For any other word i, we have 

Qi,j = ^^p{zi = k\wi = i)p(w 2 = j\z\ = k). 
k 

Denoting the probability p{z\ = k\w\ = %) as C^k, we have Qij = YlkCi,kQs k ,j- Since C is 
non- negative and £\ Cik = 1, we have that any row of Q lies in the convex hull of the rows corre- 
sponding to the anchor words. The mixing weights give us p(z\\w\ = i)\ Using this together with 
p(wi = i), we can recover the A matrix simply by using Bayes' rule: 

p(zt = k\wt = i)p(w! = i) 

p[w\ = i\z\ = k) - 



P(zi = k\wi = i')p{w\ = V) ' 



Finally, we observe that p(w\ = i) is easy to solve for since ^2jQi,j = ^2jP(w\ = i,W2 = j) = 
p(wi = i). 

Our new algorithm finds, for each row of the empirical row normalized co-occurrence matrix, Qi, 
the coefficients p(zi\u>i = i) that best reconstruct it as a convex combination of the rows that cor- 
respond to anchor words. This step can be solved quickly and in parallel (independently) for each 
word using the exponentiated gradient algorithm. Once we have p(zi\wi), we recover the A matrix 
using Bayes' rule. The full algorithm using KL divergence as an objective is found in Algorithm 3. 
Further details of the exponentiated gradient algorithm are given in the supplementary material. 

One reason to use KL divergence as the measure of reconstruction error is that the recovery 
procedure can then be understood as maximum likelihood estimation. In particular, we seek the pa- 
rameters p(wi), p(zi\wi), p{w2\zi) that maximize the likelihood of observing the word co-occurence 



5 



counts, Q. However, the optimization problem does not explicitly constrain the parameters to 
correspond an admixture model. 

We can also define a similar algorithm using quadratic loss, which we call RecoverL2. This for- 
mulation has the extremely useful property that both the objective and gradient can be kernelized 
so that the optimization problem is independent of the vocabulary size. To see this, notice that 
the objective can be re- written as 

WQi ~ CfQsW 2 = WQiW 2 - 2Ci(Q s Qj) + Cj{Q s Ql)C h 

— — T — — T 

where QsQs is K x K and can be computed once and used for all words, and QsQi is if x 1 and 
can be computed once prior to running the exponentiated gradient algorithm for word i. 

To recover the R matrix for an admixture model, recall that Q = ARA T . This may be an over- 
constrained system of equations with no solution for R, but we can find a least-squares approxima- 
tion to R by pre- and post-multiplying Q by the pseudo-inverse . For the special case of LDA we 
can learn the Dirichlet hyperparameters. Recall that in applying Bayes' rule we calculated p{z\) = 
Yli' Pi z i\ w i = i')p(wi = i'). These values for p(z) specify the Dirichlet hyperparameters up to a con- 
stant scaling. This constant could be recovered from the R matrix [Arora et al., 2012b], but in prac- 
tice we find it is better to choose it using a grid search to maximize the likelihood of the training data. 

We will see in Section 5 that our nonnegative recovery algorithm performs much better on a 
wide range of performance metrics than the recovery algorithm in Arora et al. [2012b]. In the sup- 
plementary material we show that it also inherits the theoretical guarantees of Arora et al. [2012b]: 
given polynomially many documents, our algorithm returns an estimate A at most e from the true 
word-topic matrix A. 

4 A Combinatorial Algorithm for Finding Anchor Words 

Here we consider the anchor selection step of the algorithm where our goal is to find the anchor 
words. In the infinite data case where we have infinitely many documents, the convex hull of the 
rows in Q will be a simplex where the vertices of this simplex correspond to the anchor words. 
Since we only have a finite number of documents, the rows of Q are only an approximation to their 
expectation. We are therefore given a set of V points di,d>2, ...dy that are each a perturbation of 
ai, ei2, ...ay whose convex hull P defines a simplex. We would like to find an approximation to the 
vertices of P. See Arora et al. [2012a] and Arora et al. [2012b] for more details about this problem. 

Arora et al. [2012a] give a polynomial time algorithm that finds the anchor words. However, 
their algorithm is based on solving V linear programs, one for each word, to test whether or not a 
point is a vertex of the convex hull. In this section we describe a purely combinatorial algorithm 
for this task that avoids linear programming altogether. The new "FastAnchorWords" algorithm is 
given in Algorithm 4. To find all of the anchor words, our algorithm iteratively finds the furthest 
point from the subspace spanned by the anchor words found so far. 

Since the points we are given are perturbations of the true points, we cannot hope to find the 
anchor words exactly. Nevertheless, the intuition is that even if one has only found r points S that 
are close to r (distinct) anchor words, the point that is furthest from span(5) will itself be close 
to a (new) anchor word. The additional advantage of this procedure is that when faced with many 
choices for a next anchor word to find, our algorithm tends to find the one that is most different 
than the ones we have found so far. 

The main contribution of this section is a proof that the FastAnchorWords algorithm succeeds 
in finding K points that are close to anchor words. To precisely state the guarantees, we recall the 
following definition from [Arora et al., 2012a]: 
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Algorithm 4. FastAnchorWords 

Input: V points {d±,d2, ...dy} in V dimensions, almost in a simplex with K vertices and e > 
Output: K points that are close to the vertices of the simplex. 

Project the points d{ to a randomly chosen 4 log V/e 2 dimensional subspace 
S <— {di} s.t. di is the farthest point from the origin, 
for i = 1 TO K - 1 do 

Let dj be the point in {d\, . . . , dy} that has the largest distance to span(5). 
Si-Su{dj}. 
end for 

S = {v[,v' 2 ,...v' K }. 
for i = 1 TO K do 

Let dj be the point that has the largest distance to span({i4, v 2 > ■-,v' K }\{v' i }) 

Update v\ to dj 
end for 

Return {v[, v' 2 , ...,v' K }. 



Notation: span(S') denotes the subspace spanned by the points in the set S. We compute the distance 
from a point x to the subspace span(S') by computing the norm of the projection of x onto the orthogonal 
complement of span(S'). 



Definition 4.1. A simplex P is 7-robust if for every vertex v of P, the £2 distance between v and 
the convex hull of the rest of the vertices is at least 7. 

In most reasonable settings the parameters of the topic model define lower bounds on the robustness 
of the polytope P. For example, in LDA, this lower bound is based on the largest ratio of any pair 
of hyper-parameters in the model [Arora et al., 2012b]. Our goal is to find a set of points that are 
close to the vertices of the simplex, and to make this precise we introduce the following definition: 

Definition 4.2. Let 01, a 2 , ...ay be a set of points whose convex hull P is a simplex with vertices 
Vi, 1)2, ...vk- Then we say e-covers Vj if when dj is written as a convex combination of the vertices 
as cii = Ylj c j v ji then Cj > 1 — e. Furthermore we will say that a set of K points e-covers the vertices 
if each vertex is e covered by some point in the set. 

We will prove the following theorem: suppose there is a set of points A = a\,a2, ...ay whose 
convex hull P is 7-robust and has vertices t>i,t>2, ■■■Vk (which appear in A) and that we are given 
a perturbation d\, d-i, ...dy of the points so that for each i, \\a,i — < e, then: 

Theorem 4.3. There is a combinatorial algorithm that runs in time OiV 2 + VK/e 2 ) 1 and outputs 
a subset of {d\, . . . , dy} of size K that 0(e/j)-covers the vertices provided that 20Ke/~/ 2 < 7. 

This new algorithm not only helps us avoid linear programming altogether in inferring the 
parameters of a topic model, but also can be used to solve the nonnegative matrix factorization 
problem under the separability assumption, again without resorting to linear programming. Our 
analysis rests on the following lemmas, whose proof we defer to the supplementary material. Sup- 
pose the algorithm has found a set 5 of k points that are each 5-close to distinct vertices in 
{v\,V2, vk} and that 6 < ^/20K. 

Lemma 4.4. There is a vertex Vi whose distance from span(5) is at least 7/2. 

1 In practice we find setting dimension to 1000 works well. The running time is then 0(V 2 + 1000VA"). 
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The proof of this lemma is based on a volume argument, and the connection between the volume 
of a simplex and the determinant of the matrix of distances between its vertices. 

Lemma 4.5. The point dj found by the algorithm must be 5 = 0(e/j 2 ) close to some vertex Vi. 

This lemma is used to show that the error does not accumulate too badly in our algorithm, 
since 5 only depends on e, 7 (not on the 5 used in the previous step of the algorithm). This pre- 
vents the error from accumulating exponentially in the dimension of the problem, which would be 
catastrophic for our proof. 

After running the first phase of our algorithm, we run a cleanup phase (the second loop in Alg. 4) 
that can reduce the error in our algorithm. When we have K — 1 points close to K — 1 vertices, only 
one of the vertices can be far from their span. The farthest point must be close to this missing vertex. 
The following lemma shows that this cleanup phase can improve the guarantees of Lemma A. 2: 

Lemma 4.6. Suppose \S\ = K—l and each point in S is 5 = 0{e/^ 2 ) < j/20K close to some vertex 
Vi, then the farthest point v'j found by the algorithm is 1 — 0(5/7) close to the remaining vertex. 

This algorithm is a greedy approach to maximizing the volume of the simplex. The larger the 
volume is, the more words per document the resulting model can explain. Better anchor word se- 
lection is an open question for future work. We have experimented with a variety of other heuristics 
for maximizing simplex volume, with varying degrees of success. 

Related work. The separability assumption has also been studied under the name "pure 
pixel assumption" in the context of hyperspectral unmixing. A number of algorithms have been 
proposed that overlap with ours - such as the VCA [Nascimento & Dias, 2004] algorithm (which 
differs in that there is no clean-up phase) and the N-FINDR [Gomez et al., 2007] algorithm which 
attempts to greedily maximize the volume of a simplex whose vertices are data points. However 
these algorithms have only been proven to work in the infinite data case, and for our algorithm we 
are able to give provable guarantees even when the data points are perturbed (e.g., as the result 
of sampling noise). Recent work of Thurau et al. [2010] and Kumar et al. [2012] follow the same 
pattern as our paper, but use non-negative matrix factorization under the separability assumption. 
While both give applications to topic modeling, in realistic applications the term-by-document 
matrix is too sparse to be considered a good approximation to its expectation (because documents 
are short). In contrast, our algorithm works with the Gram matrix Q so that we can give provable 
guarantees even when each document is short. 

5 Experimental Results 

We compare three parameter recovery methods, Recover [Arora et al., 2012b], RecoverKL and 
RecoverL2 to a fast implementation of Gibbs sampling [McCallum, 2002]. 2 Linear programming- 
based anchor word finding is too slow to be comparable, so we use FastAnchorWords for all three 
recovery algorithms. Using Gibbs sampling we obtain the word-topic distributions by averaging 
over 10 saved states, each separated by 100 iterations, after 1000 burn-in iterations. 

5.1 Methodology 

We train models on two synthetic data sets to evaluate performance when model assumptions 
are correct, and real documents to evaluate real- world performance. To ensure that synthetic 

2 We were not able to obtain Anandkumar et al. [2012] 's implementation of their algorithm, and our own 
implementation is too slow to be practical. 
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Figure 1: Training time on synthetic NIPS documents. 



documents resemble the dimensionality and sparsity characteristics of real data, we generate semi- 
synthetic corpora. For each real corpus, we train a model using MCMC and then generate new doc- 
uments using the parameters of that model (these parameters are not guaranteed to be separable) . 

We use two real-world data sets, a large corpus of New York Times articles (295k documents, 
vocabulary size 15k, mean document length 298) and a small corpus of NIPS abstracts (1100 
documents, vocabulary size 2500, mean length 68). Vocabularies were pruned with document 
frequency cutoffs. We generate semi-synthetic corpora of various sizes from models trained with 
K = 100 from NY Times and NIPS, with document lengths set to 300 and 70, respectively, and 
with document-topic distributions drawn from a Dirichlet with symmetric hyperparameters 0.03. 

We use a variety of metrics to evaluate models: For the semi-synthetic corpora, we can com- 
pute reconstruction error between the true word-topic matrix A and learned topic distributions. 
Given a learned matrix A and the true matrix A, we use an LP to find the best matching be- 
tween topics. Once topics are aligned, we evaluate l\ distance between each pair of topics. When 
true parameters are not available, a standard evaluation for topic models is to compute held-out 
probability, the probability of previously unseen documents under the learned model. This com- 
putation is intractable but there are reliable approximation methods [Buntine, 2009, Wallach et al., 
2009]. Topic models are useful because they provide interpretable latent dimensions. We can evalu- 
ate the semantic quality of individual topics using a metric called Coherence. Coherence is based 
on two functions, D(w) and D(w\, W2), which are number of documents with at least one instance 
of w, and of wi and W2, respectively [Mimno et al., 2011]. Given a set of words W, coherence is 



The parameter e = 0.01 is used to avoid taking the log of zero for words that never co-occur 
[Stevens et al., 2012]. This metric has been shown to correlate well with human judgments of topic 
quality. If we perfectly reconstruct topics, all the high-probability words in a topic should co-occur 
frequently, otherwise, the model may be mixing unrelated concepts. Coherence measures the quality 
of individual topics, but does not measure redundancy, so we measure inter-topic similarity. For 
each topic, we gather the set of the N most probable words. We then count how many of those 
words do not appear in any other topic's set of N most probable words. Some overlap is expected 
due to semantic ambiguity, but lower numbers of unique words indicate less useful models. 

5.2 Efficiency 

The Recover algorithms, in Python, are faster than a heavily optimized Java Gibbs sampling im- 
plementation [Yao et al., 2009]. Fig. 1 shows the time to train models on synthetic corpora on a 
single machine. Gibbs sampling is linear in the corpus size. RecoverL2 is also linear (p = 0.79), but 
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Figure 2: l\ error for a semi-synthetic model generated from a model trained on NY Times articles 
with K = 100. The horizontal line indicates the i\ error of K uniform distributions. 
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Figure 3: l\ error for a semi-synthetic model generated from a model trained on NIPS papers with 
K = 100. Recover fails for D = 2000. 



only varies from 33 to 50 seconds. Estimating Q is linear, but takes only 7 seconds for the largest 
corpus. Fast Anchor Words takes less than 6 seconds for all corpora. 

5.3 Semi-synthetic documents 

The new algorithms have good l\ reconstruction error on semi-synthetic documents, especially for 
larger corpora. Results for semi-synthetic corpora drawn from topics trained on NY Times articles 
are shown in Fig. 2 for corpus sizes ranging from 50k to 2M synthetic documents. In addition, we 
report results for the three Recover algorithms on "infinite data," that is, the true Q matrix from 
the model used to generate the documents. Error bars show variation between topics. Recover 
performs poorly in all but the noiseless, infinite data setting. Gibbs sampling has lower l\ with 
smaller corpora, while the new algorithms get better recovery and lower variance with more data 
(although more sampling might reduce MCMC error further). 

Results for semi-synthetic corpora drawn from NIPS topics are shown in Fig. 3. Recover does 
poorly for the smallest corpora (topic matching fails for D = 2000, so l\ is not meaningful), but 
achieves moderate error for D comparable to the NY Times corpus. RecoverKL and RecoverL2 
also do poorly for the smallest corpora, but are comparable to or better than Gibbs sampling, with 
much lower variance, after 40,000 documents. 

5.4 Effect of separability 

The non-negative algorithms are more robust to violations of the separability assumption than the 
original Recover algorithm. In Fig. 3, Recover does not achieve zero t\ error even with noiseless 
"infinite" data. Here we show that this is due to lack of separability. In our semi-synthetic corpora, 
documents are generated from the LDA model, but the topic-word distributions are learned from 
data and may not satisfy the anchor words assumption. We test the sensitivity of algorithms to 
violations of the separability condition by adding a synthetic anchor word to each topic that is by 
construction unique to the topic. We assign the synthetic anchor word a probability equal to the 
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SynthNYT+Anchors, L1 error 



Recover 

Recoverl_2 

RacoverKL 



Figure 4: When we add artificial anchor words before generating synthetic documents, l\ error 
goes to zero for Recover and close to zero for RecoverKL and RecoverL2. 

SynthNYT, strong corr, L1 error SynthNYT, moderate corr, L1 error 

iiiiHK 11 - uuuiki. 1 

50000 100000 150000 200000 250000 300000 400000 500000 1000000 2000000 Infinite 50000 100000 150000 200000 250000 300000 400000 500000 1000000 2000000 Infinite 

Documents Documents 

Figure 5: i\ error increases as we increase topic correlation. We use the same K = 100 topic model 
from NY Times articles, but add correlation: TOP p = 0.05, BOTTOM p = 0.1. 



most probable word in the original topic. This causes the distribution to sum to greater than 1.0, 
so we renormalize. Results are shown in Fig. 4. The l\ error goes to zero for Recover, and close 
to zero for RecoverKL and RecoverL2. The reason RecoverKL and RecoverL2 do not reach exactly 
zero is because we do not solve the optimization problems to perfect optimality. 

5.5 Effect of correlation 

The theoretical guarantees of the new algorithms apply even if topics are correlated. To test how al- 
gorithms respond to correlation, we generated new synthetic corpora from the same K = 100 model 
trained on NY Times articles. Instead of a symmetric Dirichlet distribution, we use a logistic nor- 
mal distribution with a block-structured covariance matrix. We partition topics into 10 groups. 
For each pair of topics in a group, we add a non-zero off-diagonal element to the covariance matrix. 
This block structure is not necessarily realistic, but shows the effect of correlation. Results for two 
levels of covariance (p = 0.05, p = 0.1) are shown in Fig. 5. Results for Recover are much worse in 
both cases than the Dirichlet-generated corpora in Fig. 2. The other three algorithms, especially 
Gibbs sampling, are more robust to correlation, but performance consistently degrades as correla- 
tion increases, and improves with larger corpora. With infinite data t\ error is equal to l\ error in 
the uncorrelated synthetic corpus (non-zero because of violations of the separability assumption) . 

5.6 Real documents 

The new algorithms produce comparable quantitative and qualitative results on real data. Fig. 6 
shows three metrics for both corpora. Error bars show the distribution of log probabilities across 
held-out documents (top panel) and coherence and unique words across topics (center and bottom 
panels). Held-out sets are 230 documents for NIPS and 59k for NY Times. For the small NIPS 
corpus we average over 5 non-overlapping train/test splits. The matrix- inversion in Recover failed 
for the smaller corpus (NIPS). In the larger corpus (NY Times), Recover produces noticeably worse 
held-out log probability per token than the other algorithms. Gibbs sampling produces the best 
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LLkk. Ukkk 1 

50 100 150 300 50 100 ISO 300 

Topics 

Figure 6: Held-out probability (per token) is similar for RecoverKL, RecoverL2, and Gibbs 
sampling. RecoverKL and RecoverL2 have better coherence, but fewer unique words than Gibbs. 
(Up is better for all three metrics.) 

Table 1: Example topic pairs from NY Times (closest £{), anchor words in bold. All 100 topics 
are in suppl. material. 



R.ccoverL2 


run inning game hit season zzz_anahcim_angcl 


Gibbs 


run inning hit game ball pitch 


RccovcrL2 


father family zzz_elian boy court zzz.raiami 


Gibbs 


zzz.cuba zzz_miami cuban zzz_elian boy protest 


RccovcrL2 


file sport read internet email zzz_los_angeles 


Gibbs 


web site com www mail zzz_internct 



average held-out probability (p < 0.0001 under a paired i-test), but the difference is within the 
range of variability between documents. We tried several methods for estimating hyperparameters, 
but the observed differences did not change the relative performance of algorithms. Gibbs sam- 
pling has worse coherence than the Recover algorithms, but produces more unique words per topic. 
These patterns are consistent with semi-synthetic results for similarly sized corpora (details are in 
supplementary material) . 

For each NY Times topic learned by RecoverL2 we find the closest Gibbs topic by i\ distance. 
The closest, median, and farthest topic pairs are shown in Table l. 3 We observe that when there 
is a difference, recover-based topics tend to have more specific words (Anaheim Angels vs. pitch). 

6 Conclusions 

We present new algorithms for topic modeling, inspired by Arora et al. [2012b], which are efficient 
and simple to implement yet maintain provable guarantees. The running time of these algorithms is 
effectively independent of the size of the corpus. Empirical results suggest that the sample complex- 
ity of these algorithms is somewhat greater than MCMC, but, particularly for the £2 variant, they 
provide comparable results in a fraction of the time. We have tried to use the output of our algo- 
rithms as initialization for further optimization (e.g. using MCMC) but have not yet found a hybrid 
that out-performs either method by itself. Finally, although we defer parallel implementations to 
future work, these algorithms are parallelizable, potentially supporting web-scale topic inference. 

3 The UCI NY Times corpus includes named-entity annotations, indicated by the zzz prefix. 
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A Proof for Anchor- Words Finding Algorithm 

Recall that the correctness of the algorithm depends on the following Lemmas: 

Lemma A.l. There is a vertex v, whose distance from span(S') is at least 7/2. 

Lemma A. 2. The point Aj found by the algorithm must be 5 = 0(e/^/ 2 ) close to some vertex Uj. 

In order to prove Lemma A.l, we use a volume argument. First we show that the volume of a 
robust simplex cannot change by too much when the vertices are perturbed. 

Lemma A. 3. Suppose {v±,V2, ...,vk} are the vertices of a ^-robust simplex S. Let S' be a simplex 
with vertices {v[, v' 2 , v' K }, each of the vertices v[ is a perturbation ofvi and ||^ — Vi\\ 2 < S. When 
10y/K5 < 7 the volume of the two simplices satisfy 

vol(S){l - 25/j) K - 1 < vol(S') < vol(S)(l + 46/j) K ~ l . 

Proof: As the volume of a simplex is proportional to the determinant of a matrix whose columns 
are the edges of the simplex, we first show the following perturbation bound for determinant. 

Claim A. 4. Let A, E be K x K matrices, the smallest eigenvalue of A is at least 7, the Frobenius 
norm \\E\\ F < \K8, when 7 > hyK8 we have 

det{A + E)f det(^) > (1 - S/j) K . 

Proof: Since det(AB) = det(A) det(B), we can multiply both A and A + E by A" 1 . Hence 
det(A + E)/det(A) = det(I + A^E), 

The Frobenius norm of A~ 1 E is bounded by 

||>1 _:L £|| F < ||^ _1 || 2 \\E\\ F < VKS/j. 

Let the eigenvalues of A~ 1 E be Ai, A2, \k, then by definition of Frobenius Norm ^1 — 
H^ -1 ^!!^ < K5 2 /~/ 2 . The eigenvalues of I + A~ X E are just 1 + Ai, 1 + A 2 , 1 + A^, and the 
determinant det(J + A^E) = \\f =1 {^ + Aj). Hence it suffices to show 

K K 

min JJ(1 + Xi) > (1 - 8/i) K when ^ A? < K5 2 /~f 2 . 
i=i t=i 

To do this we apply Lagrangian method and show the minimum is only obtained when all Aj's 
are equal. The optimal value must be obtained at a local optimum of 

K K 

i=l i=l 

Taking partial derivatives with respect to Aj's, we get the equations — Aj(l + A«) = — rii=i(l + 
Aj)/2C (here using y/K8/j is small so 1 + A, > 1/2 > 0). The right hand side is a constant, so each 
A; must be one of the two solutions of this equation. However, only one of the solution is larger 
than 1/2, therefore all the Aj's are equal. I 

For the lower bound, we can project the perturbed subspace to the K — 1 dimensional space. 
Such a projection cannot increase the volume and the perturbation distances only get smaller. 
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Therefore we can apply the claim directly, the columns of A are just Vi+x — x>\ for i = 1, 2, K — 1; 
columns of E are just v' i+1 — Vi + \ — (v' t — v\). The smallest eigenvalue of A is at least 7 because 
the polytope is 7 robust, which is equivalent to saying after orthogonalization each column still has 
length at least 7. The Frobenius norm of E is at most 2^K — 15. We get the lower bound directly 
by applying the claim. 

For the upper bound, swap the two sets S and S' and use the argument for the lower bound. 
The only thing we need to show is that the smallest eigenvalue of the matrix generated by points 
in S' is still at least 7/2. This follows from Wedin's Theorem [Wedin, 1972] and the fact that 
\\E\\ < \\E\\ F < VKS < 7/2. ■ 

Now we are ready to prove Lemma A.l. 
Proof: The first case is for the first step of the algorithm, when we try to find the farthest point 
to the origin. Here essentially S = {0}. For any two vertices v\,V2, since the simplex is 7 robust, 
the distance between v\ and v% is at least 7. Which means dis(0, v{) + dis(0, v%) > 7, one of them 
must be at least 7/2. 

For the later steps, recall that S contains vertices of a perturbed simplex. Let S' be the set of 
original vertices corresponding to the perturbed vertices in S. Let v be any vertex in {vi,V2, ...,vk} 
which is not in S. Now we know the distance between v and S is equal to vol(SU{v}) / (\S\ — l)vol(5). 
On the other hand, we know vol(S" U {u})/(|S"| — l)vol(S") > 7. Using Lemma A. 3 to bound the 
ratio between the two pairs vol(5)/vol(S") and vol(5U {v})/vol(S' U {«}), we get: 

dis(>, S) > (1 - 4e7 7 ) 2|5| - 2 7 > 7/2 

when 7 > 20 Ke' . ■ 

Lemma A. 2 is based on the following observation: in a simplex the point with largest £2 is 
always a vertex. Even if two vertices have the same norm if they are not close to each other the 
vertices on the edge connecting them will have significantly lower norm. 
Proof: (Lemma A. 2) 
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Figure 8: Proof of Lemma A. 2, after projecting to the orthogonal subspace of span(5). 



Since dj is the point found by the algorithm, let us consider the point dj before perturbation. 
The point dj is inside the simplex, therefore we can write dj as a convex combination of the vertices: 



Let vt be the vertex with largest coefficient q. Let A be the largest distance from some vertex 
to the space spanned by points in S (A = max^ dis(u/, span(S')). By Lemma A.l we know A > 7/2. 
Also notice that we are not assuming dis(wt, span(S)) = A. 

Now we rewrite dj as ctVt + (1 — ct)w, where w is a vector in the convex hull of vertices other 
than v t . Observe that dj must be far from span(S), because dj is the farthest point found by the 
algorithm. Indeed: 



The second inequality is because there must be some point di that correspond to the farthest 
vertex v\ and have dis(d;, span(5)) > A — e. Thus as dj is the farthest point dis(dj, span(S*)) > 
dis(d;, span(S)) > A — e. 

The point ctj is on the segment connecting vt and w, the distance between dj and span(S') is not 
much smaller than that of vt and w. Following the intuition in £2 norm when vt and w are far we 
would expect dj to be very close to either vt or w. Since q > 1/K it cannot be really close tow, so 
it must be really close to vt- We formalize this intuition by the following calculation (see Figure 8): 

Project everything to the orthogonal subspace of span(S') (points in span(S') are now at the 
origin). After projection distance to span(S') is just the I2 norm of a vector. Without loss of 
generality we assume ||t>t|| 2 = ||iw|| 2 = A because these two have length at most A, and extending 
these two vectors to have length A can only increase the length of dj . 

The point vt must be far from w by applying Lemma A.l: consider the set of vertices V = 
{vi : Vi does not correspond to any point in S and i 7^ t}. The set V U S satisfy the assumptions 
in Lemma A.l so there must be one vertex that is far from span(V' U S), and it can only be vt- 
Therefore even after projecting to orthogonal subspace of span(S'), Vt is still far from any convex 
combination of V . The vertices that are not in V' all have very small norm after projecting to 
orthogonal subspace (at most So) so we know the distance of vt and w is at least 7/2 — So > 7/4. 

Now the problem becomes a two dimensional calculation. When a is fixed the length of dj is 
strictly increasing when the distance of vt and w decrease, so we assume the distance is 7/4. Simple 
calculation (using essentially just pythagorean theorem) shows 



A 




t=i 



dis(aj, span(5*)) > dis(dj, span(5)) — e > dis(vi, span(5)) — 2e > A — 2e 
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c t (l - c t ) < , 

V ; " A - y/A 2 - 7 2 /16 

The right hand side is largest when A = 2 (since the vectors are in unit ball) and the maximum 
value is 0{e/^ 2 ). When this value is smaller than 1/K, we must have 1 — a < 0(e/j 2 ). Thus 
ct > 1 - 0(e/j 2 ) and 5 < (1 - c t ) + e < 0(e/ 7 2 ). ■ 

The cleanup phase tries to find the farthest point to a subset of K — 1 vertices, and use that 
point as the K-th. vertex. This will improve the result because when we have K — 1 points close 
to K — 1 vertices, only one of the vertices can be far from their span. Therefore the farthest point 
must be close to the only remaining vertex. Another way of viewing this is that the algorithm is 
trying to greedily maximize the volume of the simplex, which makes sense because the larger the 
volume is, the more words/documents the final LDA model can explain. 

The following lemma makes the intuitions rigorous and shows how cleanup improves the guar- 
antee of Lemma A. 2. 

Lemma A. 5. Suppose \S\ = K — l and each point in S is 5 = 0(e/7 2 ) < r y/20K close to some ver- 
tex Vi, then the farthest point v'j found by the algorithm is 1 — 0(5/7) close to the remaining vertex. 

Proof: We still look at the original point a,j and express it as ^t=i °t v t- Without loss of generality 
let vi be the vertex that does not correspond to anything in S. By Lemma A.l v\ is 7/2 far from 
span(S'). On the other hand all other vertices are at least 7/20r close to span(5). We know the 
distance dis(aj, span(S)) > dis(i>i, span(5)) — 2e, this cannot be true unless Q > 1 — 0(5/7). * 

These lemmas directly lead to the following theorem: 

Theorem A. 6. FastAnchorWords algorithm runs in time OiV 2 + VK/e 2 ) and outputs a subset of 
{d\, ...,dy} of size K that 0(e/^)-covers the vertices provided that 20Ke/"/ 2 < 7. 

Proof: In the first phase of the algorithm, do induction using Lemma A. 2. When 20Ke/^f 2 < 7 
Lemma A. 2 shows that we find a set of points that 0(e/7 2 )-covers the vertices. Now Lemma A. 5 
shows after cleanup phase the points are refined to 0(e/7)-cover the vertices. ■ 

B Proof for Nonnegative Recover Procedure 

In order to show RecoverL2 learns the parameters even when the rows of Q are perturbed, we need 
the following lemma that shows when columns of Q are close to the expectation, the posteriors c 
computed by the algorithm is also close to the true value. 

Lemma B.l. For a 7 robust simplex S with vertices {v\,V2, ...,vk}, let v be a point in the simplex 
that can be represented as a convex combination v = ^£i c « u «- If the vertices of S are perturbed 
to S' = {...jv'i, ...} where \\v'^ — Vi\\ < 5\ and v is perturbed to v' where \\v — v'\\ < 62- Let v* 
be the point in S' that is closest to v' , and v* = Yld=i c 'i v i> when lOV^^i < 7 for all i £ [K] 
\ci-c[\ < 4(5i + <5 2 )/7- 

Proof: Consider the point u = Yld=i c i v 'i-> by triangle inequality: \\u — v\\ < Y^Li c « \\ v i ~ v i\\ — ^1- 
Hence \\u — v'\\ < \\u — v\\ + \\v — v'\\ < 5\ + 62, and u is in S' . The point v* is the point in S' that 
is closest to v', so \\v* — v'\\ < 5\ + 82 and ||t>* — n|| < 2{5\ + 82)- 

Then we need to show when a point (it) moves a small distance, its representation also changes 
by a small amount. Intuitively this is true because S is 7 robust. By Lemma A.l when lOV^^i < 7i 



18 



the simplex S' is also 7/2 robust. For any i, let Proji{v*) and Proji{u) be the projections of v* 
and u in the orthogonal subspace of span(S"\f •), then 




\\Proji(v*) - Proji{u)\\ /disfa, span(SVi)) < 4($i + <J 2 )/ 7 



and this completes the proof. ■ 

With this lemma it is not hard to show that RecoverL2 has polynomial sample complexity. 
Theorem B.2. When the number of documents M is at least 



our algorithm using the conjunction of Fast Anchor Words and RecoverL2 learns the A matrix with 
entry-wise error at most e. 

Proof: (sketch) We can assume without loss of generality that each word occurs with probability 
at least e/AaK and furthermore that if M is at least 50 log V '/ D^n then the empirical matrix Q is 

entry- wise within an additive eq to the true Q = jj Yld=i AWdWjA T see [Arora et al., 2012b] for 
the details. Also, the K anchor rows of Q form a simplex that is jp robust. 

The error in each column of Q can be at most 62 = eq\J AaK / e. By Theorem A. 6 when 
20A'c>2/ \"1P) 2 < IV (which is satisfied when M = 0(aK 5 log V / D{"fp)®e)) , the anchor words found 
are 5\ = 0(62 /(jp)) close to the true anchor words. Hence by Lemma B.l every entry of C has 
error at most O^/ijp) 2 )- 

With such number of documents, all the word probabilities p(w = i) are estimated more ac- 
curately than the entries of Cij, so we omit their perturbations here for simplicity. When we 
apply the Bayes rule, we know A^ = Ci t kp(w = i)/p(z = k), where p(z = k) is au which is 
lower bounded by 1/aK. The numerator and denominator are all related to entries of C with 
positive coefficients sum up to at most 1. Therefore the errors &num and Sdenom are at most the 
error of a single entry of C, which is bounded by 0(52/('jp) 2 )- Applying Taylor's Expansion to 
(p(z = k,w = i) + 6 num )/(ak + Sdenom), the error on entries of A is at most 0(aKb~2 / {^p) 2 ) ■ When 
£q < 0((7p) 2 e 1 ' 5 /(ai(') 1 ' 5 ), we have 0(aK52/( , yp) 2 ) < e, and get the desired accuracy of A. The 
number of document required is M = 0((aK) s log V/ De^^p) 4 ). 

The sample complexity for R can then be bounded using matrix perturbation theory. ■ 

C Empirical Results 

This section contains plots for £\, held-out probability, coherence, and uniqueness for all semi- 
synthetic data sets. Up is better for all metrics except t\ error. 

C.l Sample Topics 

Tables 2, 3, and 4 show 100 topics trained on real NY Times articles using the RecoverL2 algo- 
rithm. Each topic is followed by the most similar topic (measured by t\ distance) from a model 
trained on the same documents with Gibbs sampling. When the anchor word is among the top 
six words by probability it is highlighted in bold. Note that the anchor word is frequently not the 
most prominent word. 



max.{0(aK' 



3 log V/D(~f P fe), 0{{aKf log V/De 3 {jp) 4 )} 
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Table 2: Example topic pairs from NY Times sorted by l\ distance, anchor words in bold. 



R.ccovcr L 2 


ru.n inning game hit season zzz_anahcim_angcl 


RccovcrL2 


king goal game team games season 


Gibbs 


point game team play season games 






Gibbs 


yard game season team play cjuartcrback 


R.ccovcr L 2 


point game team season games play 


Gibbs 






zzz^akcT^poi ^^^ zzz o ncal game — 
team 


Gibbs 




RccovcrL2 


point game team season player zzz_clipper 


Gibbs 


ballot ^^tion^urrrotc^votTzzz 5 ^! — ore 






Gibbs 


election baUot zzz ^ rid^zzz al^orc votes vote 

C CC 1Qn — a ° zzz - ori . a zzz a -gore vo es vo e 






RccovcrL2 


company billion companies percent million stock 


Gibbs 




R.ccovcrL2 


car race team season driver point 




race car driver racing zzz_nascar team 


RccovcrL2 




Gibbs 


co a enn tonm KsaeKall <rnin» nl aver vanVona 

season Lcam uaseutxii game piavi.i \ d iika. es 


RccovcrL2 


Palestinian zzz_isracli zzz_isracl official attack 
zzz _palestinian 


Gibbs 


Palestinian zzz_isracli zzz_isracl attack zzz_palcstinian 
zzz_yasscr _araf at 


R,ccovcrL2 


zzz_tigcr_wood shot round player par play 




zzz_tigcr_wood shot golf tour round player 


RccovcrL2 


percent stock market companies fund quarter 




percent economy market stock economic growth 


R.ccovcrL2 


zzz_al_gorc zzz_bill_bradley campaign president 
zzz_gcorgc_bush vice 


Gibbs 


zzz_al_gorc zzz_gcorgc_bush campaign presidential 
republican zzz_john_mccain 


RccovcrL2 


zzz_gcorgc_bush zzz_john_mccain campaign repub- 
lican zzz_rcpublican voter 


Gibbs 


zzz_al_gore zzz_gcorgc_bush campaign presidential 
republican zzz_john_mccain 


RccovcrL2 


net team season point player zzz_jason_kidd 


Gibbs 


point game team play season games 


RccovcrL2 


yankces run team season inning hit 


Gibbs 


season team baseball game player yankces 


RccovcrL2 


zzz_al_gorc zzz_gcorgc_bush percent president cam- 
paign zzz_bush 


Gibbs 


zzz_al_gorc zzz_gcorgc_bush campaign presidential 


RecoverL2 


zzz.cnron company firm zzz_arthur_andersen 
companies lawyer 


Gibbs 


zzz_cnron company firm accounting 
zzz_arthur_andcrscn financial 


R.ccovcrL2 


team play game yard season player 


Gibbs 


yard game season team play quarterback 


RccovcrL2 


film movie show director play character 


R,ccovcrL2 


zzz_taliban zzz_af ghanistan official zzz_u_s govern- 
ment military 


Gibbs 


zzz_taliban zzz_afghanistan zzz_pakistan afghan 


RecoverL2 


Palestinian zzz_israel israeli peace zzz_yasscr_arafat 
leader 


Gibbs 


Palestinian zzz_isracl peace israeli zzz_yasscr_aratat 
leader 


— o T~n — 

R,ccovcrLi2 


point team game shot play zzz_ccltic 


R,ccovcrL2 


zzz_bush zzz_mccain campaign republican tax 
zzz_rcpublican 


Gibbs 


zzz_al_gorc zzz_gcorgc_bush campaign presidential 
republican zzz_john_mccam 


R,ccovcrij2 


zzz_mct run team game hit season 






R,ccovcrL2 


team game season play games win 


Gibbs 


team coach game player season football 


RccovcrL2 


government war zzz_slobodan_milosevic official 
court president 


Gibbs 




RccovcrL2 


game set player zzz_pete_sampras play won 


Gibbs 


player game match team soccer play 


RccovcrL2 


zzz_al_gorc campaign zzz_bradley president demo- 
cratic zzz_clinton 


Gibbs 


zzz_al_gorc zzz_gcorgc_bush campaign presidential 
republican zzz_john_mccain 


RccovcrL2 


team zzz_knick player season point play 


Gibbs 


point game team play season games 


RccovcrL2 


com web www information sport question 
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Table 3: Example topic pairs from NY Times sorted by l\ distance, anchor words in bold. 



RccovcrL2 


season team game coach play school 


R, c c o v c r Li 2 


air shower rain wincl storm front 


Gibbs 




Recover L 2 


— u i — (Ti C ■ — TI — T~' T~ — T — i" i '• 

book til ni beginitalic cnclitalic look movie 


Gibbs 




R,ccovcrL2 


zzz_al_gorc campaign election zzz_gcorgc_bush 
zzz_florida president 


Gibbs 


republican zzz_john_mccain 


R,ccovcrL2 


race won horse zzz_kentucky_derby win winner 


R,ccovcrLy2 


company companies zzz_at percent business stock 






R,ccovcrL2 


company million companies percent business cus- 
tomer 


Gibbs 


company companies business industry firm market 


R,ccovcrL2 


team coach season player jet job 


Gibbs 


team player million season contract agent 


R,ccovcrlj2 


season team game play player zzz_cowboy 




yard game season team play quarterback 


R.ccovcrL2 


zzz_pakistan zzz_india official group attack 
zzz_unitcd_statcs 


Gibbs 


zzz_taliban zzz_afghanistan zzz_pakistan afghan 
zzz_india government 


Jrvccovcr L A 


sho^v network night television zzz_nbc program 


Gibbs 




R,ccovcrL2 


com information question zzz_castcrn commentary 


Gibbs 


com question information zzz_castcrn daily commen- 
tary 


RccovcrL2 


power plant company percent million energy 


— o — ' r~o — 


oil power energy gas prices plant 


R,ccovcrLi2 


cell stem research zzz_bush human patient 


Gibbs f 


cell research human scientist stem genes 


R,ccovcrL2 


zzz -governor _bush zzz_al_gorc campaign tax presi- 
dent plan 


Gibbs 


zzz_al gore zzz_gcorgc_bush campaign presidential 
republican zzz_j onn_mccain 


R.ccovcrL2 


cup minutes add tablespoon water oil 


Gibbs 


cup minutes add tablespoon teaspoon oil 




film 1 movie^char act e r^p lay^mi nut ^cs ho ur 


Gibbs 




RccovcrL2 


zzz_china Chinese zzz_unitcd_statcs zzz.taiwan 


Gibbs 


zzz_china Chinese zzz_bcijing zzz_taiwan government 
dcltlTcourt law case law er zzz texas 


RccovcrL2 




Gibbs 


trial death prison a case a fa^ 


RccovcrL2 


company percent million sales business companies 


Gibbs 


company companies business industry firm market 


R.ccovcrL2 


dog jump show quick brown fox 


Gibbs 




Recover L 2 


shark play team attack water game 


Recover L 2 


anthrax official mail letter worker attack 






Recover L 2 


president zzz_clinton zzz_whitc_housc zzz_bush official 
zzz_bill_clinton 


Gibbs 


zzz_bush zzz_gcorgc_bush president administration 
zzz_white_house zzz_dick_chcncy 


RccovcrL2 


father family zzz_elian boy court zzz_miami 


Gibbs 


zzz_cuba zzz_miami cuban zzz_clian boy protest 


RccovcrL2 


oil prices percent million market zzz_unitcd_statcs 


RccovcrL2 


zzz_microsoft company computer system window 
software 


Gibbs 


zzz_microsoft company companies cable zzz_at 
zzz_intcr net 


R,ccovcrL2 


government election zzz_mcxico political 
zzz_viccntc_fox president 




election political campaign zzz_party democratic 

voter 


RccovcrL2 


fight zzz_mike_tyson round right million champion 






RccovcrL2 


right law president zzz_gcorgc_bush zzz .senate 
zzz_john_ashcroft 


Gibbs 


election political campaign zzz.party democratic 
voter 


RccovcrL2 


com home look found show www 


Gibbs 




R.ccovcrL2 


car driver race zzz_dale_earnhardt racing 
zzz_nascar 


Gibbs 


night hour room hand told morning 


RccovcrL2 


book women family called author woman 
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Table 4: Example topic pairs from NY Times sorted by l\ distance, anchor words in bold. 



R,ccovcrL2 


tax bill zzz_scnatc billion plan zzz_bush 


Gibbs 


bill zzz _senate zzz _congrcss zzz _hoiisc lcgislat ion 
zzz.whitc house ^ 


— xi r~o — 


company xrancisco san com rood nomc 


Gibbs 


palm beach com statesman daily amencan — _^ 


R.ccovcr L 2 


team player season game zzz _j oiin _ro crceL* r lgnt 


Recover L 2 


zzz_bush official zzz_unitcd_states zzz_u_s president 
zzz_nort h_korea 


Gibbs 


zzz~bus!h d ~ StatCS WCaP ° n zzz " iraq nuclcar zzz_russia 


R.ccovcrL2 


zzzrussian zzz russia official militar — war attack 


Gibbs 


government war country rebel leader military 


R.ccovcr L 2 


wine wines percent zzz new york com show 


Gibbs 


film movie character play minutes hour 


R.ccovcrL 2 


police zzz r ay lewis p layer t cam case told 


Gibbs 




R,ccovcrL2 


government group political tax leader money 


Gibbs 




RccovcrL2 


percent company million airline flight deal 


Gibbs 


flight airport passenger airline security airlines 


R.ccovcrL 2 


book ages cnilclrcn school boy web 










Gibbs 


palm beach com statesman daily am eric an 






Gibbs 


zzz olympic games medal gold team sport 


R.ccovcr L 2 


priest church official abuse bishop sexual 


Gibbs 


church religious priest zzz god religion bishop 


R.ccovcr L 2 


human drug company companies million scientist 


Gibbs 


scientist light science planet called space 


R.ccovcrL2 


music zzz.napster company song com web 






Jr\.c cover ij A 


zz^ timothy McVeigh fcdcral official 


Gibbs 


trial death prison case lawyer prosecutor 


— t5 r 7 ) — 

n.ccovcr Ij A 




— s T~n — 

n.ccovcr Ij A 


buy panelist thought flavor product ounces 


Gibbs 


food restaurant chef dinner cat meal 


RccovcrL2 


school student program teacher public children 


Gibbs 






sccuHt^offic^ federal 'bill 


Gibbs 


flight airport passenger airline security airlines 


R.ccovcr L 2 


company member credit card money mean 


Gibbs 


zzz_cnron company firm accounting 
zzz_arthur_andcrscn financial 


RccovcrL2 


million percent bond tax debt bill 


Gibbs 


million program billion money government federal 


RccovcrL2 


million company zzz_ncw_york business art percent 


Gibbs 


art artist painting museum show collection 


R.ccovcrL2 


percent million number official group black 


Gibbs 


palm beach com statesman daily amcrican 


R.ccovcrL2 
Gibbs 


company tires million car zzz_f'ord percent 


Recover L 2 


article zzz_ncw_york misstated company percent com 






R.ccovcrL2 


company million percent companies government 
official 




RccovcrL2 


official million train car system plan 


Gibbs 


million program billion money government federal 


R.ccovcrL2 


test student school look percent system 


Gibbs 


patient doctor cancer medical hospital surgery 


RccovcrL2 


con una mas dice las anos 


Gibbs 


fax syndicate article com information con 


RccovcrL2 


por con una mas milloncs como 


Gibbs 


fax syndicate article com information con 


R.ccovcrL2 


las como zzz_latin_trade articulo telcfono fax 


R.ccovcrL2 


los con articulos telefono represent antes 
zzz_amcrica_latina 


Gibbs 


fax syndicate article com information con 


RccovcrL2 


file sport read internet email zzz_los_angeles 
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Figure 9: Results for a semi-synthetic model generated from a model trained on NY Times articles 
with K = 100. 
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Figure 10: Results for a semi-synthetic model generated from a model trained on NY Times 
articles with K = 100, with a synthetic anchor word added to each topic. 



D Algorithmic Details 

D.l Generating Q matrix 

For each document, let Hd be the vector in such that the i-th entry is the number of times 
word i appears in document d, rid be the length of the document and Wd be the topic vector chosen 
according to Dirichlet distribution when the documents are generated. Conditioned on W^'s, our 
algorithms require the expectation of Q to be jj Yld=i AW d Wj A T . 

In order to achieve this, similar to [Anandkumar et al., 2012], let the normalized vector 



Ha 



V n d( rl d _1 ) 



and diagonal matrix Hd = 1 f'^^5n ■ Compute the matrix 



H d,H d 



T 



1 

n>d(n d - 1) 



Here z&% is the i-th word of document d, and £ M^ 7 is the basis vector. From the generative model, 
the expectation of all terms e Zdi e T Zd are equal to AWdWj A T ', hence by linearity of expectation we 

know B[H d Hj - Hd] = AW d WjA T . 
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Figure 11: Results for a semi-synthetic model generated from a model trained on NY Times 
articles with K = 100, with moderate correlation between topics. 
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SynthNYT, strong corr, Coherence 



SynthNYT, strong corr, Unique Words 




Figure 12: Results for a semi-synthetic model generated from a model trained on NY Times 
articles with K = 100, with stronger correlation between topics. 



If we collect all the column vectors to form a large sparse matrix H, and compute the sum of 
all Hd to get the diagonal matrix H, we know Q = HH T — H has the desired expectation. The run- 
ning time of this step is 0(MD 2 ) where D 2 is the expectation of the length of the document squared. 

D.2 Exponentiated gradient algorithm 

The optimization problem that arises in RecoverKL and RecoverL2 has the following form, 

minimize d(b,Tx) 
subject to: x > and x T l = 1 

where ri(-, •) is a Bregman divergence, x is a vector of length K, and T is a matrix of size V x K. We 
solve this optimization problem using the Exponentiated Gradient algorithm [Kivinen &; Warmuth, 
1995], described in Algorithm 5. In our experiments we show results using both squared Euclidean 
distance and KL divergence for the divergence measure. Stepsizes are chosen with a line search 
to find an ij that satisfies the Wolfe and Armijo conditions (For details, see Nocedal & Wright 
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SynthNIPS, L1 error 
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Figure 13: Results for a semi-synthetic model generated from a model trained on NIPS papers 
with K = 100. For D £ {2000,6000,8000}, Recover produces log probabilities of — oo for some 
held-out documents. 



[2006]). We test for convergence using the KKT conditions. Writing the KKT conditions for our 
constrained minimization problem: 

1. Stationarity: V x d(b,Tx*) — A + fj,l = 

2. Primal Feasibility: x* > 0, |x*|i = 1 

3. Dual Feasibility: A > 

4. Complementary Slackness: AjX* = 

For every iterate of x generated by Exponentiated Gradient, we set A, [i to satisfy conditions 1-3. 
This gives the following equations: 

A = V x d(b,Tx*) + fil 
M = -{V x d(b,Tx*)) min 

By construction conditions 1-3 are satisfied (note that the multiplicative update and the projection 
step ensure that x is always primal feasible). Convergence is tested by checking whether the final 
KKT condition holds within some tolerance. Since A and x are nonnegative, we check compli- 
mentary slackness by testing whether X T x < e. This convergence test can also be thought of as 
testing the value of the primal-dual gap, since the Lagrangian function has the form: L(x, A, fj,) = 
d(b,Tx) — X T x + /i(x T l — 1), and (x T l — 1) is zero at every iteration. 

The running time of RecoverL2 is the time of solving V small (K x K) quadratic programs. 
Especially when using Exponentiated Gradient to solve the quadratic program, each word requires 
O(KV) time for preprocessing and 0(K 2 ) per iteration. The total running time is 0(KV 2 +K 2 VT) 
where T is the average number of iterations. The value of T is about 100 — 1000 depending on data 
sets. 
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Algorithm 5. Exponentiated Gradient 

Input: Matrix T, vector b, divergence measure d(-, •), tolerance parameter e 
Output: non-negative normalized vector x close to x* , the minimizer of d(b,Tx) 
Initialize x 4— j^l 
Initialize Converged <— False 
while not Converged do 

p = Vd(b,Tx) 

Choose a step size r\t 

x «— se~ w (Gradient step) 

x «— Sr (Projection onto the simplex) 

Ijl <- Vd(b, Tx) min 

A <- Vd{b,Tx)-fi 

Converged <— X T x < e 
end while 
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